The dataset was retrieved from Kaggle: Hourly energy demand generation and weather. This dataset contains 4 years of electrical consumption, generation, and pricing for Spain.
Description of the Problem
k-Nearest Neighbors (kNN), Linear Regression (including L1 and L2), Decision Tree, Random Forest, and Gradient Boosting Methods (along with its variations), are models that inherently struggle with extrapolation, particularly in the context of time series forecasting. This study aims to develop a specialized Data Preprocessing and Feature Engineering Pipeline tailored for time series challenges, with a focus on forecasting electricity prices. Our objective is to create a process for the models to enhance their ability in interpreting and predicting time-based data through appropriate feature engineering and preprocessing techniques. In this case, it is to be able to predict the hourly prices for the next 21 days.Highlights
This study delves into solving Time Series Machine Learning issues, focusing on Feature Engineering and Data Preprocessing for such problems. Key aspects include:- DateTime Extraction: Extracting and utilizing date-time data.
- Lag Variable Integration: Incorporating lag variables for better predictions.
The study emphasizes Mean Absolute Error (MAE) as the crucial metric for evaluating Time Series models, given its accuracy in measuring prediction errors.
Methodology: The Machine Learning Pipeline
Data ExplorationΒΆ
The data was explored in terms of how the features initially behaved, and how much the independent variables are correlated with each other, and how the independent variables are correlated with the target variable. This section emphasizes the use of the Pearson Correlation Coefficient Heatmap to address potential multicollinearities.
Data Preprocessing: ManualΒΆ
Subsequently executed with the Data Exploration, this section looks into the assesment of NaN Variables and potential Categorical Variables. Unlike the Feature Engineering in the next section, Data Processing was not subjected to automation (i.e., creation of functions).
Feature Engineering: AutomatedΒΆ
This section tackles the core of the study by integrating variables that are crucial to solving Time Series problems. The column containing information about the date was assessed, then was subjected to extraction by year, month, day, hour, and weekday.
This section also tackles the integration of Lag Variables. Lag Variables enable the model to utilize the previous immediate data points, thereby increasing predictability of the next datapoint.
Subjecting the Final Dataset to the ModelsΒΆ
Eight models were used for the study, namely:
- k-Nearest Neighbors (kNN)
- Linear Regression
- Linear Regression with L1 Regularization
- Linear Regression with L2 Regularization
- Decision Tree
- Random Forest
- Gradient Boosting Method
- XGBoost
For the purpose of demonstrating the effectiveness of Feature Engineering and meticulous Data Preprocessing, the hyperparameters of the models were left on default.
Execution of Methodology
Data ExplorationΒΆ
'The dataset consists of 35064 rows and 29 columns'
| generation biomass | generation fossil brown coal/lignite | generation fossil coal-derived gas | generation fossil gas | generation fossil hard coal | generation fossil oil | generation fossil oil shale | generation fossil peat | generation geothermal | generation hydro pumped storage aggregated | ... | generation waste | generation wind offshore | generation wind onshore | forecast solar day ahead | forecast wind offshore eday ahead | forecast wind onshore day ahead | total load forecast | total load actual | price day ahead | price actual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 35045.000000 | 35046.000000 | 35046.0 | 35046.000000 | 35046.000000 | 35045.000000 | 35046.0 | 35046.0 | 35046.0 | 0.0 | ... | 35045.000000 | 35046.0 | 35046.000000 | 35064.000000 | 0.0 | 35064.000000 | 35064.000000 | 35028.000000 | 35064.000000 | 35064.000000 |
| mean | 383.513540 | 448.059208 | 0.0 | 5622.737488 | 4256.065742 | 298.319789 | 0.0 | 0.0 | 0.0 | NaN | ... | 269.452133 | 0.0 | 5464.479769 | 1439.066735 | NaN | 5471.216689 | 28712.129962 | 28696.939905 | 49.874341 | 57.884023 |
| std | 85.353943 | 354.568590 | 0.0 | 2201.830478 | 1961.601013 | 52.520673 | 0.0 | 0.0 | 0.0 | NaN | ... | 50.195536 | 0.0 | 3213.691587 | 1677.703355 | NaN | 3176.312853 | 4594.100854 | 4574.987950 | 14.618900 | 14.204083 |
| min | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | NaN | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | NaN | 237.000000 | 18105.000000 | 18041.000000 | 2.060000 | 9.330000 |
| 25% | 333.000000 | 0.000000 | 0.0 | 4126.000000 | 2527.000000 | 263.000000 | 0.0 | 0.0 | 0.0 | NaN | ... | 240.000000 | 0.0 | 2933.000000 | 69.000000 | NaN | 2979.000000 | 24793.750000 | 24807.750000 | 41.490000 | 49.347500 |
| 50% | 367.000000 | 509.000000 | 0.0 | 4969.000000 | 4474.000000 | 300.000000 | 0.0 | 0.0 | 0.0 | NaN | ... | 279.000000 | 0.0 | 4849.000000 | 576.000000 | NaN | 4855.000000 | 28906.000000 | 28901.000000 | 50.520000 | 58.020000 |
| 75% | 433.000000 | 757.000000 | 0.0 | 6429.000000 | 5838.750000 | 330.000000 | 0.0 | 0.0 | 0.0 | NaN | ... | 310.000000 | 0.0 | 7398.000000 | 2636.000000 | NaN | 7353.000000 | 32263.250000 | 32192.000000 | 60.530000 | 68.010000 |
| max | 592.000000 | 999.000000 | 0.0 | 20034.000000 | 8359.000000 | 449.000000 | 0.0 | 0.0 | 0.0 | NaN | ... | 357.000000 | 0.0 | 17436.000000 | 5836.000000 | NaN | 17430.000000 | 41390.000000 | 41015.000000 | 101.990000 | 116.800000 |
8 rows Γ 28 columns
<ipython-input-8-601d89e1d09e>:7: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation_matrix = df.corr()
Data Preprocessing (Manually Done)ΒΆ
Addressing Missing ValuesΒΆ
time: object generation biomass: float64 generation fossil brown coal/lignite: float64 generation fossil coal-derived gas: float64 generation fossil gas: float64 generation fossil hard coal: float64 generation fossil oil: float64 generation fossil oil shale: float64 generation fossil peat: float64 generation geothermal: float64 generation hydro pumped storage aggregated: float64 generation hydro pumped storage consumption: float64 generation hydro run-of-river and poundage: float64 generation hydro water reservoir: float64 generation marine: float64 generation nuclear: float64 generation other: float64 generation other renewable: float64 generation solar: float64 generation waste: float64 generation wind offshore: float64 generation wind onshore: float64 forecast solar day ahead: float64 forecast wind offshore eday ahead: float64 forecast wind onshore day ahead: float64 total load forecast: float64 total load actual: float64 price day ahead: float64 price actual: float64
time 0 generation biomass 19 generation fossil brown coal/lignite 18 generation fossil coal-derived gas 18 generation fossil gas 18 generation fossil hard coal 18 generation fossil oil 19 generation fossil oil shale 18 generation fossil peat 18 generation geothermal 18 generation hydro pumped storage aggregated 35064 generation hydro pumped storage consumption 19 generation hydro run-of-river and poundage 19 generation hydro water reservoir 18 generation marine 19 generation nuclear 17 generation other 18 generation other renewable 18 generation solar 18 generation waste 19 generation wind offshore 18 generation wind onshore 18 forecast solar day ahead 0 forecast wind offshore eday ahead 35064 forecast wind onshore day ahead 0 total load forecast 0 total load actual 36 price day ahead 0 price actual 0 dtype: int64
generation hydro pumped storage aggregated and forecast wind offshore eday ahead are fully NaN values. It should be dropped completely.
time 0 generation biomass 19 generation fossil brown coal/lignite 18 generation fossil coal-derived gas 18 generation fossil gas 18 generation fossil hard coal 18 generation fossil oil 19 generation fossil oil shale 18 generation fossil peat 18 generation geothermal 18 generation hydro pumped storage consumption 19 generation hydro run-of-river and poundage 19 generation hydro water reservoir 18 generation marine 19 generation nuclear 17 generation other 18 generation other renewable 18 generation solar 18 generation waste 19 generation wind offshore 18 generation wind onshore 18 forecast solar day ahead 0 forecast wind onshore day ahead 0 total load forecast 0 total load actual 36 price day ahead 0 price actual 0 dtype: int64
No. of rows with NaN values: 47 Total Rows: 35064 47 / 35064 = 0.13% rows are NaN values Total Rows if NaN Rows are dropped: 35017
Decision: Remove NaN values completely
For this section, only 47 rows were emmitted from the 35,064 rows. 35,017 rows remained.
Addressing Potential Categorical and Ordinal ValuesΒΆ
generation fossil coal-derived gas: [0.] generation fossil oil shale: [0.] generation fossil peat: [0.] generation geothermal: [0.] generation marine: [0.] generation wind offshore: [0.]
Dropped: generation fossil coal-derived gas Dropped: generation fossil oil shale Dropped: generation fossil peat Dropped: generation geothermal Dropped: generation marine Dropped: generation wind offshore
From this section, we discovered that there are only zeros in the mentioned columns. They serve no purpose to the model whatsoever. Subsequently, this addressed the initially unusual Pearson Correlation Coefficient Heatmap seen from Data Exploration section. We can compare it to the Heatmap succeeding below this text.
Other Potentially Unnecessary VariablesΒΆ
<ipython-input-17-601d89e1d09e>:7: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation_matrix = df.corr()
Inspecting the Heatmap with the processed dataset, we see that total load forecast, total load actual, price day ahead are strongly correlated with the price actual, because they refer to future references. Therefore, we should also remove other columns that refer to future information, such as the following: forecast solar day ahead, forecast wind onshore day ahead
The model's purpose would be defeated if we base our forecasts on forecasts.
<ipython-input-18-907f53e963b4>:13: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. correlation_matrix = df.corr()
Feature Engineering (Automated by Function)ΒΆ
Extract and Integrate DateTime VariablesΒΆ
| time | year | month | day | hour | weekday | generation biomass | generation fossil brown coal/lignite | generation fossil gas | generation fossil hard coal | ... | generation hydro pumped storage consumption | generation hydro run-of-river and poundage | generation hydro water reservoir | generation nuclear | generation other | generation other renewable | generation solar | generation waste | generation wind onshore | price actual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-12-31 23:00:00+00:00 | 2014 | 12 | 31 | 23 | 2 | 447.0 | 329.0 | 4844.0 | 4821.0 | ... | 863.0 | 1051.0 | 1899.0 | 7096.0 | 43.0 | 73.0 | 49.0 | 196.0 | 6378.0 | 65.41 |
| 1 | 2015-01-01 00:00:00+00:00 | 2015 | 1 | 1 | 0 | 3 | 449.0 | 328.0 | 5196.0 | 4755.0 | ... | 920.0 | 1009.0 | 1658.0 | 7096.0 | 43.0 | 71.0 | 50.0 | 195.0 | 5890.0 | 64.92 |
| 2 | 2015-01-01 01:00:00+00:00 | 2015 | 1 | 1 | 1 | 3 | 448.0 | 323.0 | 4857.0 | 4581.0 | ... | 1164.0 | 973.0 | 1371.0 | 7099.0 | 43.0 | 73.0 | 50.0 | 196.0 | 5461.0 | 64.48 |
| 3 | 2015-01-01 02:00:00+00:00 | 2015 | 1 | 1 | 2 | 3 | 438.0 | 254.0 | 4314.0 | 4131.0 | ... | 1503.0 | 949.0 | 779.0 | 7098.0 | 43.0 | 75.0 | 50.0 | 191.0 | 5238.0 | 59.32 |
| 4 | 2015-01-01 03:00:00+00:00 | 2015 | 1 | 1 | 3 | 3 | 428.0 | 187.0 | 4130.0 | 3840.0 | ... | 1826.0 | 953.0 | 720.0 | 7097.0 | 43.0 | 74.0 | 42.0 | 189.0 | 4935.0 | 56.04 |
5 rows Γ 21 columns
Integrate Lag VariablesΒΆ
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df['price actual'].shift(lag)
<ipython-input-21-8a472389522b>:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.dropna(inplace=True)
| time | year | month | day | hour | weekday | generation biomass | generation fossil brown coal/lignite | generation fossil gas | generation fossil hard coal | ... | generation waste | generation wind onshore | lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7 | price actual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 2015-01-01 06:00:00+00:00 | 2015 | 1 | 1 | 6 | 3 | 408.0 | 172.0 | 4030.0 | 3208.0 | ... | 189.0 | 3992.0 | 51.73 | 53.63 | 56.04 | 59.32 | 64.48 | 64.92 | 65.41 | 51.43 |
| 8 | 2015-01-01 07:00:00+00:00 | 2015 | 1 | 1 | 7 | 3 | 413.0 | 177.0 | 4052.0 | 3335.0 | ... | 198.0 | 3629.0 | 51.43 | 51.73 | 53.63 | 56.04 | 59.32 | 64.48 | 64.92 | 48.98 |
| 9 | 2015-01-01 08:00:00+00:00 | 2015 | 1 | 1 | 8 | 3 | 419.0 | 177.0 | 4137.0 | 3437.0 | ... | 198.0 | 3073.0 | 48.98 | 51.43 | 51.73 | 53.63 | 56.04 | 59.32 | 64.48 | 54.20 |
| 10 | 2015-01-01 09:00:00+00:00 | 2015 | 1 | 1 | 9 | 3 | 422.0 | 173.0 | 4059.0 | 3516.0 | ... | 204.0 | 2683.0 | 54.20 | 48.98 | 51.43 | 51.73 | 53.63 | 56.04 | 59.32 | 58.94 |
| 11 | 2015-01-01 10:00:00+00:00 | 2015 | 1 | 1 | 10 | 3 | 421.0 | 226.0 | 3931.0 | 3845.0 | ... | 206.0 | 2771.0 | 58.94 | 54.20 | 48.98 | 51.43 | 51.73 | 53.63 | 56.04 | 59.86 |
5 rows Γ 28 columns
The Eight Machine Learning ModelsΒΆ
Import the librariesΒΆ
Final Pipeline FunctionsΒΆ
This section only aims to provide a streamlined version of the Pipeline, integrating all the findings and methods done by using the functions found in the Final Pipleine Functions section.
process_df(df) is only applicable to this dataset. Assuming that you add more X variables for the sake of prediction, this is the code you would use before subjecting it to create_lag(df, target_col, lag_hours) and extract_datetime(df, datetime_col)
The add_rolling_window_features(data, target_col, window_size) is placed to ensure integrity of the model and results. However, the specifics of these would not be discussed here.
Creating the Model Assessment: Baseline MAE FunctionΒΆ
When assessing the model's usefulness, we use Mean Absolute Error (MAE) instead of R-Squared. MAE provides a direct, interpretable metric of average error, making it more suitable for assessing the practical impact of prediction errors, whereas R-squared, being a relative measure of fit, is more about explaining the variance in the dependent variable by the independent variables, and can sometimes be misleading in the context of time series predictive accuracy.
In this case, we use the assumption 'price today, price tomorrow', and this is shown by data[target_col].shift(1)
Apply all the functions to the original datasetΒΆ
This section only aims to demonstrate the streamlining of the Pipeline, in conjunction to the Final Pipleine Functions section.
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:50: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df[f'lag_{lag}'] = df[target_col].shift(lag)
<ipython-input-24-d8903184fff9>:52: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df.dropna(inplace=True)
Assigning X and yΒΆ
Specifically for this part, we have to determine how many days we want to predict ahead. In this case, we want to predict the hourly prices of electricity for 21 days.
AutoML and ResultsΒΆ
Results and Discussion
In order to assess the model's accuracy and reliability,ΒΆ
we have to employe the Mean Absolute Error (MAE), as discussed in the Methodology section. The score represents the average discrepancy of the model. The lower the MAE score, the better. In this case, we have set the baseline to 2.552, and this is obtained by using MAE's function from sklearn.metrics. Below is the equation of the Mean Absolute Error.
If the MAE Test score of the model is below the Baseline MAE, then it is considered a reliable and accurate forecasting model.
For the table below, it is ordered from the most accurate to least accurate(lowest MAE to highest), and it is evident that Gradient Boosting Method had scored below the Baseline MAE by about 0.2. This is followed by Random Forest. The rest, while having low MAE scores, are less reliable than the first two. In this case, kNN has also shown its complete inability to extrapolate, unlike the other models.
It is also peculiar to see that lag_1, being the result from the previous hour datapoint, has been the most important feature for the prediction of the succeeding hourly prices. It suggests that short-term historical data plays a crucial role in forecasting future values in all of the tested models. This also indicates that the market or system has a strong temporal component, where recent history is a good indicator of the near future.
Baseline MAE: 2.522180515513944
| Machine Learning Regression Method | MAE Validation | MAE Test | Most Important Feature | |
|---|---|---|---|---|
| 6 | Gradient Boosting Method | 1.788075 | 2.280689 | lag_1 |
| 5 | Random Forest | 1.744870 | 2.324196 | lag_1 |
| 7 | XGBoost | 1.755337 | 2.617817 | lag_1 |
| 1 | Linear Regression + Lasso | 1.937245 | 2.642746 | lag_1 |
| 2 | Linear Regression + Ridge | 1.925822 | 2.669080 | lag_1 |
| 0 | Linear Regression | 1.925834 | 2.669098 | lag_1 |
| 4 | Decision Tree | 2.654942 | 3.098812 | lag_1 |
| 3 | kNN | 9.256880 | 10.802547 | NA |
Visualising the performance of the Machine Learning models,ΒΆ
we compare the predicted values with the actual values. All of the models, except kNN, have almost mirrored the actual results.
Finally, we are interested in how effective the Data Preprocessing and Feature Engineering actually is.ΒΆ
Suppose that we subject the raw dataset only to its Preprocessing phase, how well would the ML models do? What if there were no Feature Engineering involved in the first place?
To recap, we used process_df(df), create_lag(df, 'price actual', lag_hours=7), add_rolling_window_features(df, 'price actual', window_size=24), and add_baseline_prediction(df, 'price actual') for the pipeline.
For our analysis, we only subject the dataset to the process_df(df) and immediately fit it to the models.
Looking at the MAE resultsΒΆ
of the model, and using the same baseline from the original process, we see that all models completely failed to beat the baseline. Furthermore, they had much larger discrepancy, presenting a discrepancy average ranging from PHP 7 to PHP 10 per hour for all the models.
Baseline MAE: 2.522180515513944
| Machine Learning Regression Method | MAE Validation | MAE Test | Most Important Feature | |
|---|---|---|---|---|
| 6 | Gradient Boosting Method | 6.866867 | 7.971108 | generation fossil hard coal |
| 7 | XGBoost | 8.713311 | 8.487905 | generation fossil hard coal |
| 1 | Linear Regression + Lasso | 7.143360 | 9.389710 | generation other renewable |
| 0 | Linear Regression | 7.146463 | 9.433480 | generation other renewable |
| 2 | Linear Regression + Ridge | 7.146463 | 9.433480 | generation other renewable |
| 5 | Random Forest | 7.853145 | 10.825816 | generation fossil hard coal |
| 3 | kNN | 9.279271 | 10.848602 | NA |
| 4 | Decision Tree | 12.158149 | 15.559307 | generation fossil hard coal |
We further observe how evident the models fail to predict without the employment of Feature Engineering,ΒΆ
by visualizing again the actual target values versus the predicted values.
By implenting the models without any Feature Engineering to the data, we have proved the extreme significance of implementing these first to the data before subjecting them to the Machine Learning models.ΒΆ
Conclusion
From the discussions, we have effectively presented how Data Preprocessing and Feature Engineering are crucial to solving Time Series problems. A comparative analysis highlighting the presence versus absence of Time Series Feature Engineering clearly demonstrates its positive impact on model performance. This is particularly notable as the improvements were achieved even without customizing the parameters of the eight evaluated models, implying the inherent efficacy of these preprocessing techniques.Recommendation
For future applications, two pivotal factors must be given priority: ensuring data stationarity and the practical applicability of the methodology in real-world scenarios.Stationarity: Models often struggle with non-stationary datasets. To address this, applying Time Series Feature Engineering methods like 'differencing' is recommended to stabilize the dataset.
Real-World Applicability: In real-world scenarios, some input variables may be projections rather than observed values. This necessitates a rigorous model evaluation, possibly through Cross-Validation techniques tailored for Time Series data, to ensure robustness and reliability.
What to explore next?
The next phase of exploration should concentrate on models specifically designed for Time Series analysis. These models should inherently possess mechanisms to utilize outputs from previous time steps as inputs for future predictions. Such a capability, essentially creating automatic lag features, will significantly enhance the model's proficiency in handling Time Series data, making it more adept at forecasting with temporal dynamics in mind.
References
Brownlee, J. (n.d.). Feature selection for time series forecasting with Python. Machine Learning Mastery. Retrieved from https://machinelearningmastery.com/feature-selection-time-series-forecasting-python/
DotData. (n.d.). Practical guide for feature engineering of time series data. Retrieved from https://dotdata.com/blog/practical-guide-for-feature-engineering-of-time-series-data/
Jhana, N. (n.d.). Energy Consumption, Generation, Prices, and Weather Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/nicholasjhana/energy-consumption-generation-prices-and-weather?select=energy_dataset.csv